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FIELD OF THE INVENTION 

The disclosed embodiments relate to systems and methods for detecting and 
processing a desired signal in the presence of acoustic noise. 

15 BACKGROUND 

Many noise suppression algorithms and techniques have been developed over the 
years. Most of the noise suppression systems in use today for speech communication 
systems are based on a single-microphone spectral subtraction technique first develop in 
the 1970's and described, for example, by S. F. Boll in "Suppression of Acoustic Noise in 

20 Speech using Spectral Subtraction," IEEE Trans, on ASSP, pp. 1 13-120, 1979. These 
techniques have been refined over the years, but the basic principles of operation have 
remained the same. See, for example, United States Patent Number 5,687,243 of 
McLaughlin, et al., and United States Patent Number 4,81 1,404 of Vilmur, et al. 
Generally, these techniques make use of a microphone-based Voice Activity Detector 

25 (VAD) to determine the background noise characteristics, where "voice" is generally 

understood to include human voiced speech, unvoiced speech, or a combination of voiced 
and unvoiced speech. 

The VAD has also been used in digital cellular systems. As an example of such a 
use, see United States Patent Number 6,453,291 of Ashley, where a VAD configuration 

30 appropriate to the front-end of a digital cellular system is described. Further, some Code 
Division Multiple Access (CDMA) systems utilize a VAD to minimize the effective radio 
spectrum used, thereby allowing for more system capacity. Also, Global System for 
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Mobile Communication (GSM) systems can include a VAD to reduce co-channel 
interference and to reduce battery consumption on the client or subscriber device. 

These typical microphone-based VAD systems are significantly limited in 
capability as a result of the addition of environmental acoustic noise to the desired speech 

5 signal received by the single microphone, wherein the analysis is performed using typical 
signal processing techniques. In particular, limitations in performance of these 
microphone-based VAD systems are noted when processing signals having a low signal- 
to-noise ratio (SNR), and in settings where the background noise varies quickly. Thus, 
similar limitations are found in noise suppression systems using these microphone-based 

10 VADs. 
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BRIEF DESCRIPTION OF THE FIGURES 

Figure 1 is a block diagram of a denoising system, under an embodiment. 

Figure 2 is a block diagram including components of a noise removal algorithm, 
under the denoising system of an embodiment assuming a single noise source and direct 
5 paths to the microphones. 

Figure 3 is a block diagram including front-end components of a noise removal 
algorithm of an embodiment generalized to n distinct noise sources (these noise sources 
may be reflections or echoes of one another). 

Figure 4 is a block diagram including front-end components of a noise removal 
1 0 algorithm of an embodiment in a general case where there are n distinct noise sources and 
signal reflections. 

Figure 5 is a flow diagram of a denoising method, under an embodiment. 

Figure 6 shows results of a noise suppression algorithm of an embodiment for an 
American English female speaker in the presence of airport terminal noise that includes 
1 5 many other human speakers and public announcements. 

Figure 7A is a block diagram of a Voice Activity Detector (VAD) system 
including hardware for use in receiving and processing signals relating to VAD, under an 
embodiment. 

Figure 7B is a block diagram of a VAD system using hardware of a coupled noise 
20 suppression system for use in receiving VAD information, under an alternative 
embodiment. 

Figure 8 is a flow diagram of a method for determining voiced and unvoiced 
speech using an accelerometer-based VAD, under an embodiment. 

Figure 9 shows plots including a noisy audio signal (live recording) along with a 
25 corresponding accelerometer-based VAD signal, the corresponding accelerometer output 
signal, and the denoised audio signal following processing by the noise suppression 
system using the VAD signal, under an embodiment. 

Figure 10 shows plots including a noisy audio signal (live recording) along with a 
corresponding SSM-based VAD signal, the corresponding SSM output signal, and the 
30 denoised audio signal following processing by the noise suppression system using the 
VAD signal, under an embodiment. 
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Figure 11 shows plots including a noisy audio signal (live recording) along with a 
corresponding GEMS-based VAD signal, the corresponding GEMS output signal, and the 
denoised audio signal following processing by the noise suppression system using the 
VAD signal, under an embodiment. 
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DETAILED DESCRIPTION 

The following description provides specific details for a thorough understanding 
of, and enabling description for, embodiments of the noise suppression system. 
However, one skilled in the art will understand that the invention may be practiced 
5 without these details. In other instances, well-known structures and functions have not 
been shown or described in detail to avoid unnecessarily obscuring the description of the 
embodiments of the noise suppression system. In the following description, "signal" 
represents any acoustic signal (such as human speech) that is desired, and "noise" is any 
acoustic signal (which may include human speech) that is not desired. An example 

10 would be a person talking on a cellular telephone with a radio in the background. The 
person's speech is desired and the acoustic energy from the radio is not desired. In 
addition, '"user" describes a person who is using the device and whose speech is desired 
to be captured by the system. 

Also, "acoustic" is generally defined as acoustic waves propagating in air. 

1 5 Propagation of acoustic waves in media other than air will be noted as such. References 
to "speech" or "voice" generally refer to human speech including voiced speech, 
unvoiced speech, and/or a combination of voiced and unvoiced speech. Unvoiced speech 
or voiced speech is distinguished where necessary. The term "noise suppression" 
generally describes any method by which noise is reduced or eliminated in an electronic 

20 signal. 

Moreover, the term "VAD" is generally defined as a vector or array signal, data, 
or information that in some manner represents the occurrence of speech in the digital or 
analog domain. A common representation of VAD information is a one-bit digital signal 
sampled at the same rate as the corresponding acoustic signals, with a zero value 

25 representing that no speech has occurred during the corresponding time sample, and a 
unity value indicating that speech has occurred during the corresponding time sample. 
While the embodiments described herein are generally described in the digital domain, 
the descriptions are also valid for the analog domain. 

Figure 1 is a block diagram of a denoising system 1000 of an embodiment that 

30 uses knowledge of when speech is occurring derived from physiological information on 
voicing activity. The system 1000 includes microphones 10 and sensors 20 that provide 
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signals to at least one processor 30. The processor includes a denoising subsystem or 
algorithm 40. 

Figure 2 is a block diagram including components of a noise removal algorithm 
200 of an embodiment. A single noise source and a direct path to the microphones are 
5 assumed. An operational description of the noise removal algorithm 200 of an 

embodiment is provided using a single signal source 100 and a single noise source 101, 
but is not so limited. This algorithm 200 uses two microphones: a "signal" microphone 1 
("MIC1") and a "noise" microphone 2 ("MIC 2"), but is not so limited. The signal 
microphone MIC 1 is assumed to capture mostly signal with some noise, while MIC 2 

10 captures mostly noise with some signal. The data from the signal source 100 to MIC 1 is 
denoted by s(n), where s(n) is a discrete sample of the analog signal from the source 100. 
The data from the signal source 100 to MIC 2 is denoted by s 2 (n). The data from the 
noise source 101 to MIC 2 is denoted by n(n). The data from the noise source 101 to 
MIC 1 is denoted by n 2 (n). Similarly, the data from MIC 1 to noise removal element 205 

15 is denoted by m } (n), and the data from MIC 2 to noise removal element 205 is denoted by 
m 2 (n). 

The noise removal element 205 also receives a signal from a voice activity 
detection (VAD) element 204. The VAD 204 uses physiological information to 
determine when a speaker is speaking. In various embodiments, the VAD can include at 

20 least one of an accelerometer, a skin surface microphone in physical contact with skin of 
a user, a human tissue vibration detector, a radio frequency (RF) vibration and/or motion 
detector/device, an electroglottograph, an ultrasound device, an acoustic microphone that 
is being used to detect acoustic frequency signals that correspond to the user's speech 
directly from the skin of the user (anywhere on the body), an airflow detector, and a laser 

25 vibration detector. 

The transfer functions from the signal source 100 to MIC 1 and from the noise 
source 101 to MIC 2 are assumed to be unity. The transfer function from the signal 
source 100 to MIC 2 is denoted by H 2 (z), and the transfer function from the noise source 
101 to MIC 1 is denoted by H^z). The assumption of unity transfer functions does not 

30 inhibit the generality of this algorithm, as the actual relations between the signal, noise, 
and microphones are simply ratios and the ratios are redefined in this manner for 
simplicity. 
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In conventional two-microphone noise removal systems, the information from 
MIC 2 is used to attempt to remove noise from MIC 1 . However, an (generally 
unspoken) assumption is that the VAD element 204 is never perfect, and thus the 
denoising must be performed cautiously, so as not to remove too much of the signal along 
5 with the noise. However, if the VAD 204 is assumed to be perfect such that it is equal to 
zero when there is no speech being produced by the user, and equal to one when speech is 
produced, a substantial improvement in the noise removal can be made. 

In analyzing the single noise source 101 and the direct path to the microphones, 
with reference to Figure 2, the total acoustic information coming into MIC 1 is denoted 
10 by mj(n). The total acoustic information coming into MIC 2 is similarly labeled m 2 (n). 
In the z (digital frequency) domain, these are represented as M^z) and M 2 (z). Then, 

M,(z)=S(z)+N 2 (z) 
M 2 (z)=N(z)+S 2 (z) 

with 

15 N 2 (z)=N (z)H j (z) 

S 2 (z)=S(z)H 2 (z), 

so that 

M 1 (z)=S(z)+N(z)H 1 (z) 

M 2 (z)=N(z)+S(z)H 2 (z). Eq. 1 

20 This is the general case for all two microphone systems. In a practical system 

there is always going to be some leakage of noise into MIC 1, and some leakage of signal 

into MIC 2. Equation 1 has four unknowns and only two known relationships and 

therefore cannot be solved explicitly. 

However, there is another way to solve for some of the unknowns in Equation 1 . 

25 The analysis starts with an examination of the case where the signal is not being 

generated, that is, where a signal from the VAD element 204 equals zero and speech is 

not being produced. In this case, s(n) = S(z) = 0, and Equation 1 reduces to 

MJz^N&H/z) 
M 2n (z)=N(z), 

30 

where the n subscript on the M variables indicate that only noise is being received. This 
leads to 
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MJz)=M 2n (z)H } (z) 




Eq. 2 



The function Hj(z) can be calculated using any of the available system 



5 identification algorithms and the microphone outputs when the system is certain that only 
noise is being received. The calculation can be done adaptively, so that the system can 
react to changes in the noise. 

A solution is now available for one of the unknowns in Equation 1 . Another 
unknown, H 2 (z), can be determined by using the instances where the VAD equals one and 
1 0 speech is being produced. When this is occurring, but the recent (perhaps less than 1 
second) history of the microphones indicate low levels of noise, it can be assumed that 
n(s) = N(z) ~ 0. Then Equation 1 reduces to 



which is the inverse of the H^z) calculation. However, it is noted that different inputs are 
being used (now only the signal is occurring whereas before only the noise was 
20 occurring). While calculating H 2 (z), the values calculated for H,(z) are held constant and 
vice versa. Thus, it is assumed that while one of H,(z) and H 2 (z) are being calculated, the 
one not being calculated does not change substantially. 

After calculating H^z) and H 2 (z), they are used to remove the noise from the 
signal. If Equation 1 is rewritten as 



M ls (z)=S(z) 
M 2s (z)=S(z)H 2 (z), 



15 



which in turn leads to 



M 2s (z)=MJz)H 2 (z) 




25 



S(z)=M l (z)-N(z)H,(z) 
N(z)=M 2 (z)-S(z)H 2 (z) 
S(z)=M,(z)-[M 2 (z)-S(z)H 2 (z)]H 1 (z)' 
S(z)[l-H 2 (z)H,(z)] = M 1 (z)-M 2 (z)H l (z) , 



30 then N(z) may be substituted as shown to solve for S(z) as 
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l-H^HJz) 

If the transfer fiinctions H^z) and H 2 (z) can be described with sufficient accuracy, 
then the noise can be completely removed and the original signal recovered. This 
remains true without respect to the amplitude or spectral characteristics of the noise. The 
5 only assumptions made include use of a perfect VAD, sufficiently accurate H^z) and 
H 2 (z), and that when one of H,(z) and H 2 (z) are being calculated the other does not 
change substantially. In practice these assumptions have proven reasonable. 

The noise removal algorithm described herein is easily generalized to include any 
number of noise sources. Figure 3 is a block diagram including front-end components 
10 300 of a noise removal algorithm of an embodiment, generalized to n distinct noise 

sources. These distinct noise sources may be reflections or echoes of one another, but are 
not so limited. There are several noise sources shown, each with a transfer function, or 
path, to each microphone. The previously named path H 2 has been relabeled as H 0 , so 
that labeling noise source 2's path to MIC 1 is more convenient. The outputs of each 
1 5 microphone, when transformed to the z domain, are: 

M J (z)=S(z)^N J (z)H J (z)+N 2 (z)H 2 (z)+^MJz)HJ^ 
M 2 (z)=S(z)H 0 (z)+N } (z)G } (z)+N 2 (^ Eq. 4 

When there is no signal (VAD = 0), then (suppressing z for clarity) 

20 M ln =N l H ! +N 2 H 2 +...N n H„ 

M 2n =N / G / +N 2 G 2 +...N n G„. Eq. 5 

A new transfer function can now be defined as 

~ _J4 ln= N I H I +N 2 H 2 +...N n H n fi 
' M 2n N,G l+ N 2 G 2 +...N n G n ' 

25 where H x is analogous to Hj(z) above. Thus H { depends only on the noise sources and 
their respective transfer fiinctions and can be calculated any time there is no signal being 
transmitted. Once again, the "n" subscripts on the microphone inputs denote only that 
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noise is being detected, while an "s" subscript denotes that only signal is being received 
by the microphones. 

Examining Equation 4 while assuming an absence of noise produces 

M is =S 

5 M 2s =SH g . 

Thus, H 0 can be solved for as before, using any available transfer function calculating 
algorithm. Mathematically, then, 

10 Rewriting Equation 4, using H , defined in Equation 6, provides, 



9,. M '~ S . Eq. 7 

1 M 2 -SH 0 



Solving for S yields, 



1-H,H, 

15 which is the same as Equation 3, with H 0 taking the place of H 2 , and H x taking the place 
of H t . Thus the noise removal algorithm still is mathematically valid for any number of 

noise sources, including multiple echoes of noise sources. Again, if H 0 and H x can be 
estimated to a high enough accuracy, and the above assumption of only one path from the 
signal to the microphones holds, the noise may be removed completely. 

20 The most general case involves multiple noise sources and multiple signal 

sources. Figure 4 is a block diagram including front-end components 400 of a noise 
removal algorithm of an embodiment in the most general case where there are n distinct 
noise sources and signal reflections. Here, signal reflections enter both microphones MIC 
1 and MIC 2. This is the most general case, as reflections of the noise source into the 

25 microphones MIC 1 and MIC 2 can be modeled accurately as simple additional noise 
sources. For clarity, the direct path from the signal to MIC 2 is changed from H 0 (z) to 
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H m (z\ and the reflected paths to MIC 1 and MIC 2 are denoted by H 01 (z) and H 02 (z), 
respectively. 

The input into the microphones now becomes 

5 ^ 2 <y=S^ Eq. 9 

When the VAD = 0, the inputs become (suppressing z again) 

M ln =N } H l +N 2 H 2 +...N n H n 
M 2M =N l G 1 +N 2 G 2 +...N m G m9 

which is the same as Equation 5. Thus, the calculation of H x in Equation 6 is unchanged, 
as expected. In examining the situation where there is no noise, Equation 9 reduces to 

M 1S =S+SH 01 



10 



M 2s =SH 60 +SH 02 . 



1 5 This leads to the definition of H 2 as 



~ M jL = H m +H 02 1Q 
M„ 1+H„ 

Rewriting Equation 9 again using the definition for H, (as in Equation 7) 
provides 

a M,- S (1,H 0I ) n 
M 2 -S(H m +H 02 ) 



20 Some algebraic manipulation yields 

S(1+H 0I -H,(H m +H 02 ))=M, -M 2 H, 



S(l+H m )l-H, 



(H m +H 02 ) 



=M, -M 2 H, 



O+H 0l ) 
and finally 
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S(1 + H C ,)= M '-?L H ' . Eq. 12 

Equation 12 is the same as equation 8, with the replacement of H 0 by H 2 , and the 
addition of the (1 + H 01 ) factor on the left side. This extra factor (1 + H 01 ) means that S 
cannot be solved for directly in this situation, but a solution can be generated for the 
5 signal plus the addition of all of its echoes. This is not such a bad situation, as there are 
many conventional methods for dealing with echo suppression, and even if the echoes are 
not suppressed, it is unlikely that they will affect the comprehensibility of the speech to 

any meaningful extent. The more complex calculation of H 2 is needed to account for the 

signal echoes in MIC 2, which act as noise sources. 

10 Figure 5 is a flow diagram 500 of a denoising algorithm, under an embodiment. 

In operation, the acoustic signals are received, at block 502. Further, physiological 
information associated with human voicing activity is received, at block 504. A first 
transfer function representative of the acoustic signal is calculated upon determining that 
voicing information is absent from the acoustic signal for at least one specified period of 

15 time, at block 506. A second transfer function representative of the acoustic signal is 
calculated upon determining that voicing information is present in the acoustic signal for 
at least one specified period of time, at block 508. Noise is removed from the acoustic 
signal using at least one combination of the first transfer function and the second transfer 
function, producing denoised acoustic data streams, at block 510. 

20 An algorithm for noise removal, or denoising algorithm, is described herein, from 

the simplest case of a single noise source with a direct path to multiple noise sources with 
reflections and echoes. The algorithm has been shown herein to be viable under any 
environmental conditions. The type and amount of noise are inconsequential if a good 

estimate has been made of H x and H 2 , and if one does not change substantially while the 
25 other is calculated. If the user environment is such that echoes are present, they can be 

compensated for if coming from a noise source. If signal echoes are also present, they 

will affect the cleaned signal, but the effect should be negligible in most environments. 
In operation, the algorithm of an embodiment has shown excellent results in 

dealing with a variety of noise types, amplitudes, and orientations. However, there are 
30 always approximations and adjustments that have to be made when moving from 
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mathematical concepts to engineering applications. One assumption is made in Equation 
3, where H 2 (z) is assumed small and therefore H 2 (z)H I (z) « 0, so that Equation 3 reduces 
to 

S(z) *M I (z)-M 2 (z)H j (z). 

5 This means that only Hj(z) has to be calculated, speeding up the process and reducing the 
number of computations required considerably. With the proper selection of 
microphones, this approximation is easily realized. 

Another approximation involves the filter used in an embodiment. The actual 
H t (z) will undoubtedly have both poles and zeros, but for stability and simplicity an all- 

1 0 zero Finite Impulse Response (FIR) filter is used. With enough taps the approximation to 
the actual Hj(z) can be very good. 

To further increase the performance of the noise suppression system, the spectrum 
of interest (generally about 125 to 3700 Hz) is divided into subbands. The wider the 
range of frequencies over which a transfer function must be calculated, the more difficult 

15 it is to calculate it accurately. Therefore the acoustic data was divided into 16 subbands, 
and the denoising algorithm was then applied to each subband in turn. Finally, the 16 
denoised data streams were recombined to yield the denoised acoustic data. This works 
very well, but any combinations of subbands (i.e., 4, 6, 8, 32, equally spaced, 
perceptually spaced, etc.) can be used and all have been found to work better than a single 

20 subband. 

The amplitude of the noise was constrained in an embodiment so that the 
microphones used did not saturate (that is, operate outside a linear response region). It is 
important that the microphones operate linearly to ensure the best performance. Even 
with this restriction, very low signal-to-noise ratio (SNR) signals can be denoised (down 
25 to -10 dB or less). 

The calculation of H { (z) is accomplished every 10 milliseconds using the Least- 
Mean Squares (LMS) method, a common adaptive transfer function. An explanation may 
be found in "Adaptive Signal Processing" (1985), by Widrow and Steams, published by 
Prentice-Hall, ISBN 0-13-004029-0. The LMS was used for demonstration purposes, but 
30 many other system idenfication techniques can be used to identify H^z) and H 2 (z) in 
Figure 2. 
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The VAD for an embodiment is derived from a radio frequency sensor and the 
two microphones, yielding very high accuracy (>99%) for both voiced and unvoiced 
speech. The VAD of an embodiment uses a radio frequency (RF) vibration detector 
interferometer to detect tissue motion associated with human speech production, but is 

5 not so limited. The signal from the RF device is completely acoustic-noise free, and is 
able to function in any acoustic noise environment. A simple energy measurement of the 
RF signal can be used to determine if voiced speech is occurring. Unvoiced speech can 
be determined using conventional acoustic-based methods, by proximity to voiced 
sections determined using the RF sensor or similar voicing sensors, or through a 

10 combination of the above. Since there is much less energy in unvoiced speech, its 

detection accuracy is not as critical to good noise suppression performance as is voiced 
speech. 

With voiced and unvoiced speech detected reliably, the algorithm of an 
embodiment can be implemented. Once again, it is useful to repeat that the noise 

15 removal algorithm does not depend on how the VAD is obtained, only that it is accurate, 
especially for voiced speech. If speech is not detected and training occurs on the speech, 
the subsequent denoised acoustic data can be distorted. 

Data was collected in four channels, one for MIC 1, one for MIC 2, and two for 
the radio frequency sensor that detected the tissue motions associated with voiced speech. 

20 The data were sampled simultaneously at 40 kHz, then digitally filtered and decimated 
down to 8 kHz. The high sampling rate was used to reduce any aliasing that might result 
from the analog to digital process. A four-channel National Instruments A/D board was 
used along with Lab view to capture and store the data. The data was then read into a C 
program and denoised 10 milliseconds at a time. 

25 Figure 6 shows a denoised audio 602 signal output upon application of the noise 

suppression algorithm of an embodiment to a dirty acoustic signal 604, under an 
embodiment. The dirty acoustic signal 604 includes speech of an American English- 
speaking female in the presence of airport terminal noise where the noise includes many 
other human speakers and public announcements. The speaker is uttering the numbers 

30 "406 5562" in the midst of moderate airport terminal noise. The dirty acoustic signal 604 
was denoised 10 milliseconds at a time, and before denoising the 10 milliseconds of data 
were prefiltered from 50 to 3700 Hz. A reduction in the noise of approximately 17 dB is 

15 
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evident. No post filtering was done on this sample; thus, all of the noise reduction 
realized is due to the algorithm of an embodiment. It is clear that the algorithm adjusts to 
the noise instantly, and is capable of removing the very difficult noise of other human 
speakers. Many different types of noise have all been tested with similar results, 
5 including street noise, helicopters, music, and sine waves. Also, the orientation of the 
noise can be varied substantially without significantly changing the noise suppression 
performance. Finally, the distortion of the cleaned speech is very low, ensuring good 
performance for speech recognition engines and human receivers alike. 

The noise removal algorithm of an embodiment has been shown to be viable 
10 under any environmental conditions. The type and amount of noise are inconsequential if 

a good estimate has been made of H } and H 2 . If the user environment is such that 
echoes are present, they can be compensated for if coming from a noise source. If signal 
echoes are also present, they will affect the cleaned signal, but the effect should be 
negligible in most environments. 

1 5 When using the VAD devices and methods described herein with a noise 

suppression system, the VAD signal is processed independently of the noise suppression 
system, so that the receipt and processing of VAD information is independent from the 
processing associated with the noise suppression, but the embodiments are not so limited. 
This independence is attained physically (i.e., different hardware for use in receiving and 

20 processing signals relating to the VAD and the noise suppression), but is not so limited. 
The VAD devices/methods described herein generally include vibration and 
movement sensors, but are not so limited. In one embodiment, an accelerometer is placed 
on the skin for use in detecting skin surface vibrations that correlate with human speech. 
These recorded vibrations are then used to calculate a VAD signal for use with or by an 

25 adaptive noise suppression algorithm in suppressing environmental acoustic noise from a 
simultaneously (within a few milliseconds) recorded acoustic signal that includes both 
speech and noise. 

Another embodiment of the VAD devices/methods described herein includes an 
acoustic microphone modified with a membrane so that the microphone no longer 
30 efficiently detects acoustic vibrations in air. The membrane, though, allows the 

microphone to detect acoustic vibrations in objects with which it is in physical contact 
(allowing a good mechanical impedance match), such as human skin. That is, the 
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acoustic microphone is modified in some way such that it no longer detects acoustic 
vibrations in air (where it no longer has a good physical impedance match), but only in 
objects with which the microphone is in contact. This configures the microphone, like 
the accelerometer, to detect vibrations of human skin associated with the speech 
5 production of that human while not efficiently detecting acoustic environmental noise in 
the air. The detected vibrations are processed to form a VAD signal for use in a noise 
suppression system, as detailed below. 

Yet another embodiment of the VAD described herein uses an electromagnetic 
vibration sensor, such as a radiofrequency vibrometer (RF) or laser vibrometer, which 

1 0 detect skin vibrations. Further, the RF vibrometer detects the movement of tissue within 
the body, such as the inner surface of the cheek or the tracheal wall. Both the exterior 
skin and internal tissue vibrations associated with speech production can be used to form 
a VAD signal for use in a noise suppression system as detailed below. 

Figure 7A is a block diagram of a VAD system 702A including hardware for use 

15 in receiving and processing signals relating to VAD, under an embodiment. The VAD 
system 702 A includes a VAD device 730 coupled to provide data to a corresponding 
VAD algorithm 740. Note that noise suppression systems of alternative embodiments 
can integrate some or all functions of the VAD algorithm with the noise suppression 
processing in any manner obvious to those skilled in the art. Referring to Figure 1, the 

20 voicing sensors 20 include the VAD system 702A, for example, but are not so limited. 
Referring to Figure 2, the VAD includes the VAD system 702A, for example, but is not 
so limited. 

Figure 7B is a block diagram of a VAD system 702B using hardware of the 
associated noise suppression system 701 for use in receiving VAD information 764, 

25 under an embodiment. The VAD system 702B includes a VAD algorithm 750 that 
receives data 764 from MIC 1 and MIC 2, or other components, of the corresponding 
signal processing system 700. Alternative embodiments of the noise suppression system 
can integrate some or all functions of the VAD algorithm with the noise suppression 
processing in any manner obvious to those skilled in the art. 

30 The vibration/movement-based VAD devices described herein include the 

physical hardware devices for use in receiving and processing signals relating to the VAD 
and the noise suppression. As a speaker or user produces speech, the resulting vibrations 
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propagate through the tissue of the speaker and, therefore can be detected on and beneath 
the skin using various methods. These vibrations are an excellent source of VAD 
information, as they are strongly associated with both voiced and unvoiced speech 
(although the unvoiced speech vibrations are much weaker and more difficult to detect) 
5 and generally are only slightly affected by environmental acoustic noise (some 

devices/methods, for example the electromagnetic vibrometers described below, are not 
affected by environmental acoustic noise). These tissue vibrations or movements are 
detected using a number of VAD devices including, for example, accelerometer-based 
devices, skin surface microphone (SSM) devices, and electromagnetic (EM) vibrometer 
1 0 devices including both radio frequency (RF) vibrometers and laser vibrometers. 

Accelerometer-based VAD Devices/Methods 

Accelerometers can detect skin vibrations associated with speech. As such, and 
with reference to Figure 2 and Figure 7A, a VAD system 702 A of an embodiment 
1 5 includes an accelerometer-based device 730 providing data of the skin vibrations to an 
associated algorithm 740. The algorithm 740 of an embodiment uses energy calculation 
techniques along with a threshold comparison, as described herein, but is not so limited. 
Note that more complex energy-based methods are available to those skilled in the art. 

Figure 8 is a flow diagram 800 of a method for determining voiced and unvoiced 
20 speech using an accelerometer-based VAD, under an embodiment. Generally, the energy 
is calculated by defining a standard window size over which the calculation is to take 
place and summing the square of the amplitude over time as 

Energy = £ x? , 

i 

where i is the digital sample subscript and ranges from the beginning of the window to 

25 the end of the window. 

Referring to Figure 8, operation begins upon receiving accelerometer data, at 
block 802. The processing associated with the VAD includes filtering the data from the 
accelerometer to preclude aliasing, and digitizing the filtered data for processing, at block 
804. The digitized data is segmented into windows 20 milliseconds (msec) in length, and 

30 the data is stepped 8 msec at a time, at block 806. The processing further includes 
filtering the windowed data, at block 808, to remove spectral information that is 
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corrupted by noise or is otherwise unwanted. The energy in each window is calculated by 
summing the squares of the amplitudes as described above, at block 810. The calculated 
energy values can be normalized by dividing the energy values by the window length; 
however, this involves an extra calculation and is not needed as long as the window 
5 length is not varied. 

The calculated, or normalized, energy values are compared to a threshold, at block 
812. The speech corresponding to the accelerometer data is designated as voiced speech 
when the energy of the accelerometer data is at or above a threshold value, at block 814. 
Likewise, the speech corresponding to the accelerometer data is designated as unvoiced 

10 speech when the energy of the accelerometer data is below the threshold value, at block 
816. Noise suppression systems of alternative embodiments can use multiple threshold 
values to indicate the relative strength or confidence of the voicing signal, but are not so 
limited. Multiple subbands may also be processed for increased accuracy. 

Figure 9 shows plots including a noisy audio signal (live recording) 902 along 

15 with a corresponding accelerometer-based VAD signal 904, the corresponding 

accelerometer output signal 912, and the denoised audio signal 922 following processing 
by the noise suppression system using the VAD signal 904, under an embodiment The 
noise suppression system of this embodiment includes an accelerometer (Model 352A24) 
from PCB Piezotronics, but is not so limited. In this example, the accelerometer data has 

20 been bandpass filtered between 500 and 2500 Hz to remove unwanted acoustic noise that 
can couple to the accelerometer below 500 Hz. The audio signal 902 was recorded using 
a microphone set and standard accelerometer in a babble noise environment inside a 
chamber measuring six (6) feet on a side and having a ceiling height of eight (8) feet. 
The microphone set, for example, is available from Aliph, Brisbane, California. The 

25 noise suppression system is implemented in real-time, with a delay of approximately 10 
msec. The difference in the raw audio signal 902 and the denoised audio signal 922 
shows noise suppression approximately in the range of 25-30 dB with little distortion of 
the desired speech signal. Thus, denoising using the accelerometer-based VAD 
information is very effective. 

30 
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Skin Surface Microphone (SSM) VAD Devices/Methods 
Referring again to Figure 2 and Figure 7 A, a VAD system 702A of an 
embodiment includes a SSM VAD device 730 providing data to an associated algorithm 
740. The SSM is a conventional microphone modified to prevent airborne acoustic 
5 information from coupling with the microphone's detecting elements. A layer of silicone 
or other covering changes the impedance of the microphone and prevents airborne 
acoustic information from being detected to a significant degree. Thus this microphone is 
shielded from airborne acoustic energy but is able to detect acoustic waves traveling in 
media other than air as long as it maintains physical contact with the media. The silicone 
1 0 or similar material allows the microphone to mechanically couple efficiently with the 
skin of the user. 

During speech, when the SSM is placed on the cheek or neck, vibrations 
associated with speech production are easily detected. However, airborne acoustic data is 
not significantly detected by the SSM. The tissue-borne acoustic signal, upon detection 

15 by the SSM, is used to generate the VAD signal in processing and denoising the signal of 
interest, as described above with reference to the energy/threshold method used with 
accelerometer-based VAD signal and Figure 8. 

Figure 10 shows plots including a noisy audio signal (live recording) 1002 along 
with a corresponding SSM-based VAD signal 1004, the corresponding SSM output signal 

20 1012, and the denoised audio signal 1022 following processing by the noise suppression 
system using the VAD signal 1004, under an embodiment. The audio signal 1002 was 
recorded using an Aliph microphone set and standard accelerometer in a babble noise 
environment inside a chamber measuring six (6) feet on a side and having a ceiling height 
of eight (8) feet. The noise suppression system is implemented in real-time, with a delay 

25 of approximately 10 msec. The difference in the raw audio signal 1002 and the denoised 
audio signal 1022 clearly show noise suppression approximately in the range of 20-25 dB 
with little distortion of the desired speech signal. Thus, denoising using the SSM-based 
VAD information is effective. 



30 Electromagnetic (EM) Vibrometer VAD Devices/Methods 

Returning to Figure 2 and Figure 7A, a VAD system 702A of an embodiment 
includes an EM vibrometer VAD device 730 providing data to an associated algorithm 
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740. The EM vibrometer devices also detect tissue vibration, but can do so at a distance 
and without direct contact of the tissue targeted for measurement. Further, some EM 
vibrometer devices can detect vibrations of internal tissue of the human body. The EM 
vibrometers are unaffected by acoustic noise, making them good choices for use in high 
5 noise environments. The noise suppression system of an embodiment receives VAD 
information from EM vibrometers including, but not limited to, RF vibrometers and laser 
vibrometers, each of which are described in turn below. 

The RF vibrometer operates in the radio to microwave portion of the 
electromagnetic spectrum, and is capable of measuring the relative motion of internal 

10 human tissue associated with speech production. The internal human tissue includes 
tissue of the trachea, cheek, jaw, and/or nose/nasal passages, but is not so limited. The 
RF vibrometer senses movement using low-power radio waves, and data from these 
devices has been shown to correspond very well with calibrated targets. As a result of the 
absence of acoustic noise in the RF vibrometer signal, the VAD system of an 

15 embodiment uses signals from these devices to construct a VAD using the 

energy/threshold method described above with reference to the accelerometer-based VAD 
and Figure 8. 

An example of an RF vibrometer is the General Electromagnetic Motion Sensor 
(GEMS) radiovibrometer available from Aliph, located in Brisbane, California. Other RF 

20 vibrometers are described in the Related Applications and by Gregory C. Burnett in "The 
Physiological Basis of Glottal Electromagnetic Micropower Sensors (GEMS) and Their 
Use in Defining an Excitation Function for the Human Vocal Tract", Ph.D. Thesis, 
University of California Davis, January 1999. 

Laser vibrometers operate at or near the visible frequencies of light, and are 

25 therefore restricted to surface vibration detection only, similar to the accelerometer and 
the SSM described above. Like the RF vibrometer, there is no acoustic noise associated 
with the signal of the laser vibrometers. Therefore, the VAD system of an embodiment 
uses signals from these devices to construct a VAD using the energy/threshold method 
described above with reference to the accelerometer-based VAD and Figure 8. 

30 Figure 11 shows plots including a noisy audio signal (live recording) 1 102 along 

with a corresponding GEMS-based VAD signal 1 104, the corresponding GEMS output 
signal 1112, and the denoised audio signal 1 122 following processing by the noise 
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suppression system using the VAD signal 1 104, under an embodiment. The GEMS- 
based VAD signal 1 104 was received from a trachea-mounted GEMS radiovibrometer 
from Aliph, Brisbane, California. The audio signal 1 102 was recorded using an Aliph 
microphone set in a babble noise environment inside a chamber measuring six (6) feet on 

5 a side and having a ceiling height of eight (8) feet. The noise suppression system is 
implemented in real-time, with a delay of approximately 10 msec. The difference in the 
raw audio signal 1 102 and the denoised audio signal 1 122 clearly show noise suppression 
approximately in the range of 20-25 dB with little distortion of the desired speech signal. 
Thus, denoising using the GEMS-based VAD information is effective. It is clear that 

10 both the VAD signal and the denoising are effective, even though the GEMS is not 
detecting unvoiced speech. Unvoiced speech is normally low enough in energy that it 
does not significantly affect the convergence of H^z) and therefore the quality of the 
denoised speech. 

Aspects of the noise suppression system may be implemented as functionality 

15 programmed into any of a variety of circuitry, including programmable logic devices 
(PLDs), such as field programmable gate arrays (FPGAs), programmable array logic 
(PAL) devices, electrically programmable logic and memory devices and standard cell- 
based devices, as well as application specific integrated circuits (ASIGs). Some other 
possibilities for implementing aspects of the noise suppression system include: 

20 microcontrollers with memory (such as electronically erasable programmable read only 
memory (EEPROM)), embedded microprocessors, firmware, software, etc. If aspects of 
the noise suppression system are embodied as software at least one stage during 
manufacturing (e.g. before being embedded in firmware or in a PLD), the software may 
be carried by any computer readable medium, such as magnetically- or optically-readable 

25 disks (fixed or floppy), modulated on a carrier signal or otherwise transmitted, etc. 

Furthermore, aspects of the noise suppression system may be embodied in 
microprocessors having software-based circuit emulation, discrete logic (sequential and 
combinatorial), custom devices, fuzzy (neural) logic, quantum devices, and hybrids of 
any of the above device types. Of course the underlying device technologies may be 

30 provided in a variety of component types, e.g., metal-oxide semiconductor field-effect 
transistor (MOSFET) technologies like complementary metal-oxide semiconductor 
(CMOS), bipolar technologies like emitter-coupled logic (ECL), polymer technologies 
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(e.g., silicon-conjugated polymer and metal-conjugated polymer-metal structures), mixed 
analog and digital, etc. 

Unless the context clearly requires otherwise, throughout the description and the 
claims, the words "comprise," "comprising," and the like are to be construed in an 
5 inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of 
"including, but not limited to." Words using the singular or plural number also include 
the plural or singular number respectively. Additionally, the words "herein," 
"hereunder," "above," "below," and words of similar import, when used in this 
application, shall refer to this application as a whole and not to any particular portions of 
1 0 this application. When the word "or" is used in reference to a list of two or more items, 
that word covers all of the following interpretations of the word: any of the items in the 
list, all of the items in the list and any combination of the items in the list. 

The above descriptions of embodiments of the noise suppression system are not 
intended to be exhaustive or to limit the noise suppression system to the precise forms 
1 5 disclosed. While specific embodiments of, and examples for, the noise suppression 

system are described herein for illustrative purposes, various equivalent modifications are 
possible within the scope of the noise suppression system, as those skilled in the relevant 
art will recognize. The teachings of the noise suppression system provided herein can be 
applied to other processing systems and communication systems, not only for the 
20 processing systems described above. 

The elements and acts of the various embodiments described above can be 
combined to provide further embodiments. These and other changes can be made to the 
noise suppression system in light of the above detailed description. 

All of the above references and United States patent applications are incorporated 
25 herein by reference. Aspects of the noise suppression system can be modified, if 
necessary, to employ the systems, functions and concepts of the various patents and 
applications described above to provide yet further embodiments of the noise suppression 
system. 

In general, in the following claims, the terms used should not be construed to limit 
30 the noise suppression system to the specific embodiments disclosed in the specification 
and the claims, but should be construed to include all processing systems that operate 
under the claims to provide a method for compressing and decompressing data files or 
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streams. Accordingly, the noise suppression system is not limited by the disclosure, but 
instead the scope of the noise suppression system is to be determined entirely by the 
claims. 

While certain aspects of the noise suppression system are presented below in 
certain claim forms, the inventors contemplate the various aspects of the noise 
suppression system in any number of claim forms. For example, while only one aspect of 
the noise suppression system is recited as embodied in computer-readable medium, other 
aspects may likewise be embodied in computer-readable medium. Accordingly, the 
inventors reserve the right to add additional claims after filing the application to pursue 
such additional claim forms for other aspects of the noise suppression system. 
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